drawing

Week 8 - Support Vector Machines

Dr. David Elliott

  1. Support Vector Classifier (SVC)

  2. Support Vector Machine (SVM)

Common Notation

Notes

We can't always perfectly separate the data with a $p − 1$ dimensional hyperplane. To overcome this problem we could either:

1.4. Support Vector Classifier (SVC)

SVC's are a generalisation and extension of the maximal margin classifier so it can be applied to a broader range of cases$^1$.

In practice they are more robust to individual observations and better classify most training observations than the Maximal Margin Classifier. This is because they take the approach it is better to missclassify some training examples in order to do a better job classifying the rest.

This is called a soft margin as it allows some violations by the training data by a small subset of training observation, not only on the wrong side of the margin, but wrong side of the hyperplane.

Notes

We want to relax the following constraints when necessary:

$$ \mathbf{w}^{\mathrm T}\mathbf{x}_i + b \geq 1 \text{ for } y_i = 1, \\ \mathbf{w}^{\mathrm T}\mathbf{x}_i + b \leq -1 \text{ for } y_i = -1 $$

This can be done by introducing positive slack variables $\xi_i, i = 1, \ldots, n$ in the constraints$^{5,6,10}$:

$$ \mathbf{w}^{\mathrm T}\mathbf{x}_i + b \geq 1 - \xi_i \quad \text{if} \quad y_i = 1, \\ \mathbf{w}^{\mathrm T}\mathbf{x}_i + b \leq -1 + \xi_i \quad \text{if} \quad y_i = -1, \\ \xi_i \geq 0 \quad \forall_i. $$

Notes

Tuning Parameter (C)

To ensure there is a penelty, $C$, for relaxing the constraint, we can change our objective function to be minimised from $\frac{1}{2}||\mathbf{w}||^2$ to,

$$ \begin{align*} {\text{minimise} \atop \mathbf{w}, b, \xi } & \quad \frac{1}{2}||\mathbf{w}||^2+C\sum\limits_{i=1}^n\xi_i, \\ \text{subject to} & \quad y_i(\mathbf{w}^{\mathrm T}\mathbf{x}_i+b) \geq 1-\xi_i, \quad \xi_i \geq 0, \quad \forall_i. \end{align*} $$

$C$ is a tuning parameter that controls the bias-variance trade-off$^1$.

The strength of the regularization is inversely proportional to $C$, meaning a large $C$ has a larger error penalty.

Extra

Neither the $\xi_i$ or their Lagrange multipliers appear in the Wolfe dual problem. This means we now have$^6$:

$$ \text{max} L_D \equiv \sum_i^n\alpha_i - \frac{1}{2}\sum^n_{i,k}\alpha_i\alpha_ky_iy_k\mathbf{x}_i\cdot \mathbf{x}_k \quad \text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i\alpha_iy_i = 0. $$

This also has the same solution as before:

$$ \mathbf{\hat w} = \sum\limits^{N_S}_{i=1}\alpha_iy_i\mathbf{x}_i. $$

Also, sometimes $C$ is defined as $C = \frac{1}{\nu N}$, where $0 < \nu \leq 1$ controls the fraction of misclasified points during the training phase$^7$.

Notes

1.5. Support Vector Machine (SVM)

Aims to address the situation where the boundary between two classes is not linear.

Feature Engineering

We could consider enlarging the feature space to make the dataset linearly separable.

Example: We can see below that our $\mathbf{x}_1$ is not linearly separable but it is when we add in our second feature $\mathbf{x}_2 = (\mathbf{x}_1)^2$

Using quadratic, cubic or higher-order polynomial functions we can project our data onto a higher-dimensional space via a mapping function $\phi$ where they are linearly separable (using a linear SVM model in this new feature space).

Example

$\phi(\mathbf{x}_1, \mathbf{x}_2) = (\mathbf{z}_1,\mathbf{z}_2,\mathbf{z}_3) = (\mathbf{x}_1,\mathbf{x}_2,\mathbf{x}^2_1+\mathbf{x}^2_2)$

Gaussian Radial Basis Function$^2$

We could instead use a "similarity function", such as a Gaussian Radial Basis Function (RBF),

$\phi_\gamma(\mathbf{x},\ell) = \exp(-\gamma||\mathbf{x}-\ell||^2)$.

This is a bell-shaped function which measures how much an instance resembles a landmark, with the function varying from 0 (far away) to 1 (at the landmark).

Example: Below we set our landmarks to $x_1 = -2$ and $x_1 = 1$.

Using the example of $x_1=-1$ we can see it is a distance of 1 from the first landmark and 2 from the second.

If we set $\gamma = 0.3$ then our new features are:

$x_2 = \exp(-0.3 \times 1^2) \approx 0.74$

$x_3 = \exp(-0.3 \times 2^2) \approx 0.30$

In order to find the landmarks the simplist approach is just to create a landmark at each instance in the dataset.

Kernels

However, by using feature engineering to enlarge our feature space, the larger the number of features, the higher computational burden.

Instead it is common to enlarge the feature space using an extension of a SVC termed a Support Vector Machine, which uses kernels.

The Kernel trick can relies on the fact we can define our SVM in the form of inner products.

$$ L_D(\alpha_i) = \sum_i^n\alpha_i - \frac{1}{2}\sum_{i,k}^n\alpha_i\alpha_ky_iy_k\mathbf{x}_i^{\mathrm T}\mathbf{x}_k \qquad \text{s.t.} \quad \forall_i \alpha_i \geq 0, \ \sum_i^n\alpha_iy_i = 0. $$

Imagine we had a mapping $\phi$ which maps the data to some high dimensional Euclidean space

$$ \phi: \mathbb{R}^d \mapsto H, $$

then, we could do dot products between vectors after the mapping in $H$:

$$ L_D(\alpha_i) = \sum_i^n\alpha_i - \frac{1}{2}\sum_{i,k}^n\alpha_i\alpha_ky_iy_k\phi(\mathbf{x}_i^{\mathrm T})\phi(\mathbf{x}_k) $$

Instead we could use a kernel function,

$$ K(\mathbf{x}_i,\mathbf{x}_k) = \phi(\mathbf{x}_i^{\mathrm T})\Phi(\mathbf{x}_k). $$

Notes

Second-Degree Polynomial Example$^2$

Suppose we wanted to apply a second-degree polynomial transformation to two 2D vectors, $\mathbf{a}$ and $\mathbf{b}$, and then compute the dot product. We could do this by:

$$ \phi(\mathbf{a})^{\mathrm T}\phi(\mathbf{b}) = \begin{pmatrix} {a_1}^2 \\ \sqrt{2}a_1a_2 \\ {a_2}^2 \end{pmatrix}^{\mathrm T} \begin{pmatrix} {b_1}^2 \\ \sqrt{2}b_1b_2 \\ {b_2}^2\end{pmatrix} = {a_1}^2{b_1}^2+2a_1b_1a_2b_2+{a_2}^2{b_2}^2. $$

Instead we could use the kernel approach:

$$ K(\mathbf{a}, \mathbf{b}) = (\mathbf{a}^{\mathrm T}\mathbf{b})^2 = \begin{pmatrix} \begin{pmatrix} a_1 \\ a_2 \end{pmatrix}^{\mathrm T}\begin{pmatrix} b_1 \\ b_2 \end{pmatrix}\end{pmatrix}^2 = (a_1b_1+a_2b_2)^2 = {a_1}^2{b_1}^2+2a_1b_1a_2b_2+{a_2}^2{b_2}^2. $$

This is useful as it means we didnt have to map our data using $\phi$ first. This saves us time!

Notes

$$ \Phi(\mathbf{x}) = \Phi((\mathbf{x}_1, \mathbf{x}_2)) = ({\mathbf{x}_1}^2, \sqrt{2}\mathbf{x}_1\mathbf{x}_2, {\mathbf{x}_2}^2) $$

We still have all the same considerations, but replacing $\mathbf{x}_i^{\mathrm T} \mathbf{x}_k$ with $K(\mathbf{x}_i, \mathbf{x}_k)$ allows us to produce a SVM in infinite dimensional space.

$$ \sum_i^n\alpha_i - \frac{1}{2}\sum_i^n\sum_k^n\alpha_i\alpha_ky_iy_kK(\mathbf{x}_i, \mathbf{x}_k) \qquad \text{s.t.} \quad \forall_i \alpha_i \geq 0, \ \sum_i^n\alpha_iy_i = 0. $$

In the test phase, we can use the support vectors$^6$:

$$ \begin{align*} f(\mathbf{x}^*) &= \sum_{i \in s}\hat\alpha_iy_i\phi(\mathbf{x}_i)\cdot \phi(\mathbf{x}^*) + \hat b \\ &= \sum_{i \in s}\hat\alpha_iy_iK(\mathbf{x}_i,\mathbf{x}^*) +\hat b, \end{align*} $$

avoiding computing $\Phi(\mathbf{x}^*)$.

Note

$$ \begin{align*} f(\mathbf{x}^*) &= \sum_{i = 1}^{n_s}\hat\alpha_iy_i\Phi(\mathbf{s}_i)\cdot \Phi(\mathbf{x}^*) + \hat b \\ &= \sum_{i = 1}^{n_s}\hat\alpha_iy_iK(\mathbf{s}_i,\mathbf{x}^*) +\hat b. \end{align*} $$

Polynomial kernels

The polynomial kernel of degree $d$ (a positive integer) can be defined as:

$$K(\mathbf{x}_i,\mathbf{x}_k) = \left(\gamma \left<\mathbf{x}_i, \mathbf{x}_k\right> + r \right)^d$$

Hyperparameters

Notes

You shouldnt stick to the default settings, instead you want to search for optimal hyperparameters for the data. Good suggestions for search spaces are:

Radial Basis Function kernel

The most widely used kernel is the RBF kernel (also known as a Gaussian kernel)$^{4,8}$:

$$K(\mathbf{x}_i,\mathbf{x}_k) = \exp\left(-\gamma||\mathbf{x}_i-\mathbf{x}_k||^2\right),$$

where $\gamma$ is a free parameter to be optimised and $||\mathbf{x}_i-\mathbf{x}_k||^2$ is the squared Euclidean distance.

$\gamma$ is often either a positive constant or $\frac{1}{2\sigma^2}$.

When classifying a test observation $\mathbf{x}^* = (x^*_1...x^*_p)$, only training observations close to $\mathbf{x}^*$ (in terms of Euclidean distance) will play a role in its class label. This is because $(x^*_j-x_{ij})^2$ will be large, so $\exp(-\gamma\sum^P_{j=1}(x^*_j-x_{ij})^2)$ will be small$^1$.

Notes

Euclidean distance8

$$d(\mathbf{x}_i, \mathbf{x}_k) \stackrel{\text{def}}{=} \sqrt{\left(x_{i1}-x_{k1}\right)^2+\left(x_{i2}-x_{k2}\right)^2 + \ldots + \left(x_{iN}-x_{kN}\right)^2} = \sqrt{\sum_{j=1}^D\left(x_{ij}-x_{kj}\right)^2}$$

Hyperparameters2

$\gamma$ is effectively acting like a regularization hyperparameter, so like $C$ if your model is overfitting reduce it and underfitting then increase it.

$C$ and $\gamma$ are tightly coupled.

Generally if you have a larger/narrow gamma (e.g. $\gamma = 5$) you'll need more regularisation, so a larger $C$.

If you have a smaller/wider gamma (e.g. $\gamma = 1$), a smaller value of $C$ should be used7.

Extra Example

Lets have a look at the decision boundaries for our "optimal hyperparameters" as a comparison to those by the linear models we saw earlier.

Standardization$^{11}$

SVMs, along with many other machine learning estimators (e.g. l1 and l2 regularizers of linear models), are sensitive to feature scales.

If a feature has a variance orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

Notes

Standardization is especially important for RBF kernels.

These models assume that all features look like standard normally distributed data: Gaussian with zero mean and unit variance.

Dataset Example: Breast Cancer

We can see below for example that the features in the Breast Cancer dataset are of completely different orders of magnitude$^{16}$.

Notes

Recap$^1$

The kernel approach is an efficient computational approach to enlarge our feature space to accommodate a non-linear boundary.

Assume we have a new point $x^*$. If wanted to compute $f(x^*)$ using our linear classifier we would need to the inner product between $x^*$ and each training point $x_i$:

$$f(x) = \sum_{i\in s}\alpha_i \left<x^*,x_i\right> + b.$$

Instead of actually calculating the inner product, we could instead use a generalisation, $K(x,x_{i^{\prime}})$, where $K$ is a kernel. We can now define the classifier as:

$$f(x) = \sum_{i\in s}\alpha_i K\left(x^*,x_i\right) + b.$$

A kernel is a function that quantifies the similarity of two observations. For example, for a linear kernel we could use:

$$K(x_i, x_{i^\prime}) = \sum^p_{j=1}x_{ij}x_{{i^\prime}j},$$

where we quantifiy the similarity of pairs of observations using Pearson (standard) correlation.

However, we could use other forms of kernel to fit the support vector classifier in a higher-dimensional space, such as a polynomial kernel:

$$K(x_i, x_{i^\prime}) = \left(\gamma \sum^p_{j=1}x_{ij}x_{{i^\prime}j} + r\right)^d.$$

Another popular choice is the radial kernel:

$$K(x_i, x_{i^\prime}) = \exp\left(-\gamma\sum^p_{j=1}(x_{ij}-x_{{i^\prime}j})^2\right).$$

Associated Exercises

Now might be a good time to try exercises 5-8.

References

  1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
  2. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
  3. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. " O'Reilly Media, Inc.".
  4. Raschka, 2016
  5. Cortes, C. and Vapnik, V. Support vector networks. Machine Learning, 20:273–297, 1995
  6. Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
  7. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
  8. Burkov, A. (2019). The hundred-page machine learning book (Vol. 1). Canada: Andriy Burkov.
  9. Zhang, J. (2015). A complete list of kernels used in support vector machines. Biochem. Pharmacol.(Los Angel), 4, 2167-0501.
  10. Raschka, Sebastian, and Vahid Mirjalili. "Python Machine Learning: Machine Learning and Deep Learning with Python." Scikit-Learn, and TensorFlow. Second edition ed (2017).
  11. https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
  12. https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
  13. Al-Mejibli, I. S., Alwan, J. K., & Abd Dhafar, H. (2020). The effect of gamma value on support vector machine performance with different kernels. International Journal of Electrical and Computer Engineering, 10(5), 5497.